Language Identification in Document Images
نویسندگان
چکیده
This paper presents a system dedicated to automatic language identification of text regions in heterogeneous and complex documents. This system is able to process documents with mixed printed and handwritten text and various layouts. To handle such a problem, we propose a system that performs the following sub-tasks: writing type identification (printed/handwritten), script identification and language identification. The methods for the writing type recognition and the script discrimination are based on the analysis of the connected components while the language identification approach relies on a statistical text analysis, which requires a recognition engine. We evaluate the system on a new public dataset and present detailed results on the three tasks. Our system outperforms the Google plug-in evaluated on the ground-truth transcriptions of the same dataset.
منابع مشابه
Script and Language Identification in Degraded and Distorted Document Images
This paper reports a statistical identification technique that differentiates scripts and languages in degraded and distorted document images. We identify scripts and languages through document vectorization, which transforms each document image into an electronic document vector that characterizes the shape and frequency of the contained character and word images. We first identify scripts bas...
متن کاملScript Identification for Document Image Retrieval: A Survey
In recent years there are many multimedia documents captured and stored with the advances in computer technology and hence the demand for recognizing and retrieval of such documents has increased tremendously .In such environment the large volume of data and variety of scripts make manual identification unworkable. In such cases the ability to automatically determine the script ,and further the...
متن کاملLanguage Identification in Degraded and Distorted Document Images
This paper presents a language identification technique that differentiates Latin-based languages in degraded and distorted document images. Different from the reported methods that transform word images through a character shape coding process, our method directly captures word shapes with the local extremum points and the horizontal intersection numbers, which are both tolerant of noise, char...
متن کاملScript and Language Identification for Document Images and Scene Texts
In recent times, there have been an increase in Optical Character Recognition (OCR) solutions for recognizing the text from scanned document images and scene-texts taken with the mobile devices. Many of these solutions works very good for individual script or language. But in multilingual environment such as in India, where a document image or scene-images may contain more than one language, th...
متن کاملLanguage identification for handwritten document images using a shape codebook
Language identification for handwritten document images is an open document analysis problem. In this paper, we propose a novel approach to language identification for documents containing mixture of handwritten and machine printed text using image descriptors constructed from a codebook of shape features. We encode local text structures using scale and rotation invariant codewords, each repres...
متن کاملInformation Extraction from Symbolically Compressed Document Images
The extraction of information from symbolically compressed document images is an increasingly important problem as the related standard (JBIG2) and commercial products become available. Symbolic compression techniques work by clustering individual connected connected components (blobs) in a document image and storing the sequence of occurrence of blobs and representative blob templates, hence t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016